Search CORE

34 research outputs found

Using migratable objects to enhance fault tolerance schemes in supercomputers

Author: Gengbin Zheng
Mendes Celso
Meneses-Rojas Esteban
Xiang Ni
Publication venue: IEEE Computer Society
Publication date: 01/07/2015
Field of study

Supercomputers have seen an exponential increase in their size in the last two decades. Such a high growth rate is expected to take us to exascale in the timeframe 2018-2022. But, to bring a productive exascale environment about, it is necessary to focus on several key challenges. One of those challenges is fault tolerance. Machines at extreme scale will experience frequent failures and will require the system to avoid or overcome those failures. Various techniques have recently been developed to tolerate failures. The impact of these techniques and their scalability can be substantially enhanced by a parallel programming model called migratable objects. In this paper, we demonstrate how the migratable-objects model facilitates and improves several fault tolerance approaches. Our experimental results on thousands of cores suggest fault tolerance schemes based on migratable objects have low performance overhead and high scalability. Additionally, we present a performance model that predicts a significant benefit of using migratable objects to provide fault tolerance at extreme scale

Repositorio Institucional del Instituto Tecnologico de Costa Rica

Simulation-Based Performance Prediction for Large Parallel Machines

Author: Gengbin Zheng
Laxmikant V. Kalé
O.S. Lawlor
Praveen Jagadishprasad
Terry Wilmarth
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Hierarchical Load Balancing for Charm++ Applications on Large Supercomputers

Author: Abhinav Bhatele ́
Esteban Meneses
Gengbin Zheng
Laxmikant V. Kalé
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2010
Field of study

Abstract — Large parallel machines with hundreds of thou-sands of processors are being built. Recent studies have shown that ensuring good load balance is critical for scaling certain classes of parallel applications on even thousands of processors. Centralized load balancing algorithms suffer from scalability problems, especially on machines with relatively small amount of memory. Fully distributed load balancing algorithms, on the other hand, tend to yield poor load balance on very large machines. In this paper, we present an automatic dynamic hierarchical load balancing method that overcomes the scala-bility challenges of centralized schemes and poor solutions of traditional distributed schemes. This is done by creating multiple levels of aggressive load balancing domains which form a tree. This hierarchical method is demonstrated within a measurement-based load balancing framework in Charm++. We present techniques to deal with scalability challenges of load balancing at very large scale. We show performance data of the hierarchical load balancing method on up to 16,384 cores of Ranger (at TACC) for a synthetic benchmark. We also demonstrate the successful deployment of the method in a scientific application, NAMD with results on the Blue Gene/P machine at ANL. I

CiteSeerX

Crossref

Kaapi / Charm++ preliminary comparison

Author: Besseron Xavier
Gautier Thierry
Kalé Laxmikant V.
Zheng Gengbin
Publication venue
Publication date: 01/06/2010
Field of study

Open Repository and Bibliography - Luxembourg

Enabling and scaling biomolecular simulations of 100 million atoms on petascale machines with a multicore-optimized message-driven runtime

Author: Chao Mei
Chris Harrison
Eric J. Bohm
Gengbin Zheng
James C. Phillips
Laxmikant V. Kale
Yanhua Sun
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2011
Field of study

A 100-million-atom biomolecular simulation with NAMD is one of the three benchmarks for the NSF-funded sustainable petascale machine. Simulating this large molecular system on a petascale machine presents great challenges, including handling I/O, large memory footprint and getting good strong-scaling results. In this paper, we present parallel I/O techniques to enable the simula-tion. A new SMP model is designed to efficiently utilize ubiquitous wide multicore clusters by extending the CHARM++ asynchronous message-driven runtime. We exploit node-aware techniques to op-timize both the application and the underlying SMP runtime. Hi-erarchical load balancing is further exploited to scale NAMD to the full Jaguar PF Cray XT5 (224,076 cores) at Oak Ridge Na-tional Laboratory, both with and without PME full electrostatics, achieving 93 % parallel efficiency (vs 6720 cores) at 9 ms per step for a simple cutoff calculation. Excellent scaling is also obtained on 65,536 cores of the Intrepid Blue Gene/P at Argonne National Laboratory. 1

CiteSeerX

Crossref

NAMD: biomolecular simulation on thousands of processors

Author: Gengbin Zheng
James C Phillips
Laxmikant V Kalé
Sameer Kumar
Publication venue
Publication date: 01/01/2002
Field of study

Abstract NAMD is a fully featured, production molecular dynamics program for high performance simulation of large biomolecular systems. We have previously, at SC2000, presented scaling results for simulations with cutoff electrostatics on up to 2048 processors of the ASCI Red machine, achieved with an object-based hybrid force and spatial decomposition scheme and an aggressive measurement-based predictive load balancing framework. We extend this work by demonstrating similar scaling on the much faster processors of the PSC Lemieux Alpha cluster, and for simulations employing efficient (order N log N) particle mesh Ewald full electrostatics. This unprecedented scalability in a biomolecular simulation code has been attained through latency tolerance, adaptation to multiprocessor nodes, and the direct use of the Quadrics Elan library in place of MPI by the Charm++/Converse parallel runtime system

CiteSeerX

Simulating Large Scale Parallel Applications Using Statistical Models for Sequential Execution Blocks

Author: Eric Bohm
Gagan Gupta
Gengbin Zheng
Isaac Dooley
Laxmikant V. Kalé
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2010
Field of study

Predicting sequential execution blocks of a large scale parallel application is an essential part of accurate prediction of the overall performance of the application. When simulating a future machine that is not yet fabricated, or a prototype system only available at a small scale, it becomes a significant challenge. Using hardware simulators may not be feasible due to excessively slowed down execution times and insufficient resources. These challenging issues become increasingly difficult in proportion to scale of the simulation. In this paper, we propose an approach based on statistical models to accurately predict the performance of the sequential execution blocks that comprise a parallel application. We de-ployed these techniques in a trace-driven simulation framework to capture both the detailed behavior of the application as well as the overall predicted performance. The technique is validated using both synthetic benchmarks and the NAMD application. Index Terms—parallel simulator, performance prediction, trace-driven, machine learning, statistical model I

CiteSeerX

Crossref

Automatic MPI to AMPI Program Transformation Using Photran

Author: Gengbin Zheng
Kuo-chuan Pan
Laxmikant V. Kalé
Natasha Negara
Paul M. Ricker
Ralph E. Johnson
Stas Negara
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2011
Field of study

Abstract. Adaptive MPI, or AMPI, is an implementation of the Mes-sage Passing Interface (MPI) standard. AMPI benefits MPI applications with features such as dynamic load balancing, virtualization, and check-pointing. Because AMPI uses multiple user-level threads per physical core, global variables become an obstacle. It is thus necessary to con-vert MPI programs to AMPI by eliminating global variables. Manually removing the global variables in the program is tedious and error-prone. In this paper, we present a Photran-based tool that automates this task with a source-to-source transformation that supports Fortran. We eval-uate our tool on the multi-zone NAS Benchmarks with AMPI. We also demonstrate the tool on a real-world large-scale FLASH code and present preliminary results of running FLASH on AMPI. Both results show sig-nificant performance improvement using AMPI. This demonstrates that the tool makes using AMPI easier and more productive.

CiteSeerX

Crossref

Achieving High Performance on Extremely Large Parallel Machines: Performance Prediction and Load Balancing

Author: Zheng Gengbin
Publication venue
Publication date: 01/01/2005
Field of study

Parallel machines with an extremely large number of processors (at least tens of thousands processors) are now in operation. For example, the IBM BlueGene/L machine with 128K processors is currently being deployed. It is going to be a significant challenge for application developers to write parallel programs in order to exploit the enormous compute power available and manually scale their applications on such machines. Solving these problems involves finding suitable parallel programming models for such machines and addressing issues like load imbalance. In this thesis, we explore Charm++ programming model and its migratable objects for programming such machines and dynamic load balancing techniques to help parallel applications to easily scale on a large number of processors. We also present a parallel simulator that is capable of predicting parallel performance to help analysis and tuning of the parallel performance and facilitate the development of new load balancing techniques, even before such machines are built. We evaluate the idea of virtualization and its usefulness in helping a programmer to write applications with high degree of parallelism. We demonstrate it by developing several mini-applications with million-way parallelism. We show that Charm++ and AMPI (an extension to MPI) with migratable objects and support for load balancing are suitable programming model for programming very large machines. It is important to understand the performance of parallel applications on very large parallel machines. This thesis explores Parallel Discrete Event Simulation (PDES) techniques with an optimistic synchronization protocol to simulate parallel applications running on a very large number of processors. We optimize the synchronization protocol by exploiting the inherent determinacy that is normally found in parallel applications to reduce the synchronization overhead significantly

CiteSeerX

Illinois Digital Environment for Access to Learning and Scholarship Repository